Skip to content

Migrate CitationMixin cache from CSV to SQLite#24

Merged
wenh06 merged 3 commits intomasterfrom
feature/sqlite-cache
Feb 22, 2026
Merged

Migrate CitationMixin cache from CSV to SQLite#24
wenh06 merged 3 commits intomasterfrom
feature/sqlite-cache

Conversation

@wenh06
Copy link
Contributor

@wenh06 wenh06 commented Feb 22, 2026

  • Changed backend storage from CSV to SQLite for improved performance and concurrency support.
  • Implemented automatic backward-compatible migration: existing CSV caches will be converted to SQLite on first use.
  • Updated get_citation and update_cache to use SQL queries for efficient data retrieval.
  • Updated tests to verify SQLite cache functionality and migration logic.

Copilot AI review requested due to automatic review settings February 22, 2026 07:52
@wenh06 wenh06 enabled auto-merge February 22, 2026 07:52
@wenh06 wenh06 self-assigned this Feb 22, 2026
@wenh06 wenh06 added the enhancement New feature or request label Feb 22, 2026
@codecov
Copy link

codecov bot commented Feb 22, 2026

Codecov Report

❌ Patch coverage is 98.75000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 97.85%. Comparing base (99d3c16) to head (874db75).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
bib_lookup/citation_mixin.py 98.75% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #24      +/-   ##
==========================================
+ Coverage   97.04%   97.85%   +0.80%     
==========================================
  Files           8        8              
  Lines        1254     1306      +52     
==========================================
+ Hits         1217     1278      +61     
+ Misses         37       28       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates CitationMixin’s on-disk citation cache from a CSV file to a SQLite database, aiming to improve lookup/update performance and better handle concurrent access, with a one-time automatic migration from legacy CSV caches.

Changes:

  • Replaced CSV cache read/write logic in CitationMixin.get_citation() / CitationMixin.update_cache() with SQLite queries and inserts.
  • Added _init_db() to initialize the SQLite schema and migrate an existing CSV cache on first use.
  • Updated the citation mixin test to seed/verify cache state using SQLite, and updated the changelog entry.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
bib_lookup/citation_mixin.py Implements SQLite-backed cache, adds CSV→SQLite migration, updates cache read/write paths.
test/test_citation_mixin.py Updates the existing test to write/read the cache via SQLite instead of CSV.
CHANGELOG.rst Documents the cache backend change and migration behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 112 to 121
# Use placeholders for the IN clause
placeholders = ",".join("?" * len(doi))

if format is not None and format != self._bl.format:
citation = "" # no cache for format other than bibtex
existing_dois = set()
cached_citations = []
else:
citation = "\n".join(df_cc[df_cc["doi"].isin(doi)]["citation"].tolist())
doi = [item for item in doi if item not in df_cc["doi"].tolist()]
if print_result:
query = f"SELECT citation FROM citations WHERE doi IN ({placeholders})"
cursor.execute(query, doi)
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IN (...) query uses one bound parameter per DOI. SQLite has a fairly low default maximum for host parameters (commonly 999), so get_citation() can start failing for large DOI lists (regression vs the previous pandas-based filtering). Consider chunking doi into batches and querying/inserting per batch, or using a temporary table to join against.

Copilot uses AI. Check for mistakes.
Comment on lines 64 to 74
# Back up existing cache if any
if bib_lookup.CitationMixin.citation_cache.exists():
df_system_cache = pd.read_csv(bib_lookup.CitationMixin.citation_cache)
df_system_cache = _read_cache_db(bib_lookup.CitationMixin.citation_cache)
else:
df_system_cache = None
df_cache_before_update.to_csv(bib_lookup.CitationMixin.citation_cache, index=False)

# Write initial cache state for test
if bib_lookup.CitationMixin.citation_cache.exists():
bib_lookup.CitationMixin.citation_cache.unlink()

_write_cache_db(df_cache_before_update, bib_lookup.CitationMixin.citation_cache)
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test reads/writes CitationMixin.citation_cache under ~/.cache/bib-lookup, which can delete or overwrite a developer’s real cache when running tests locally. Prefer using tmp_path (or similar) and monkeypatching CitationMixin.citation_cache_db/citation_cache_csv (or CACHE_DIR) to an isolated temp directory for the duration of the test.

Copilot uses AI. Check for mistakes.
Comment on lines 70 to 77
# Write initial cache state for test
if bib_lookup.CitationMixin.citation_cache.exists():
bib_lookup.CitationMixin.citation_cache.unlink()

_write_cache_db(df_cache_before_update, bib_lookup.CitationMixin.citation_cache)

obj = SomeClass()
obj.update_cache()
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR introduces automatic CSV→SQLite migration in CitationMixin._init_db(), but the updated test only exercises the SQLite cache path. Add a focused test that creates a legacy bib-lookup-cache.csv, calls get_citation()/update_cache(), and asserts the DB is populated correctly and the CSV is removed (or retained on failure).

Copilot uses AI. Check for mistakes.
Comment on lines 50 to 54
# Delete CSV after successful migration
self.citation_cache_csv.unlink()
print(f"Migrated citation cache from CSV to SQLite: {self.citation_cache_db}")
except Exception as e:
print(f"Failed to migrate CSV cache: {e}")
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_init_db() prints migration success/failure unconditionally. This will write to stdout for any library user calling get_citation()/update_cache(), which is a behavior change and makes it hard to use the library in apps that capture stdout. Prefer using warnings.warn(...) or the project’s verbosity/logging mechanism, and only emit a message when the user opted into it.

Copilot uses AI. Check for mistakes.
Comment on lines 108 to 114
# Fetch from cache
conn = sqlite3.connect(self.citation_cache_db)
cursor = conn.cursor()

# Use placeholders for the IN clause
placeholders = ",".join("?" * len(doi))

Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placeholders = ",".join("?" * len(doi)) will produce an empty IN () clause when self.doi is an empty sequence, causing a SQL syntax error. Consider early-returning when doi is empty (e.g., return ""/None depending on print_result) before building the SQL query.

Copilot uses AI. Check for mistakes.
- Changed backend storage from CSV to SQLite for improved performance and concurrency support.
- Implemented automatic backward-compatible migration: existing CSV caches will be converted to SQLite on first use.
- Updated 'get_citation' and 'update_cache' to use SQL queries for efficient data retrieval.
- Updated tests to verify SQLite cache functionality and migration logic.
@wenh06 wenh06 force-pushed the feature/sqlite-cache branch from 9b43477 to 716cc4a Compare February 22, 2026 08:47
wenh06 and others added 2 commits February 22, 2026 17:10
- Replace print with warnings.warn for migration status
- Handle empty DOI sequence to avoid invalid SQL syntax
- Chunk DOI list in SQL queries to avoid SQLite parameter limits
- Update tests to check for warnings instead of stdout output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wenh06 wenh06 disabled auto-merge February 22, 2026 13:38
@wenh06 wenh06 merged commit 34dad0a into master Feb 22, 2026
10 checks passed
@wenh06 wenh06 deleted the feature/sqlite-cache branch February 22, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants